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Abstract 

Context-free grammar simplification is a subject of high importance in computer language processing tech¬ 
nology as well as in formal language theory. This paper presents a formalization, using the Coq proof 
assistant, of the fact that general context-free grammars generate languages that can be also generated 
by simpler and equivalent context-free grammars. Namely, useless symbol elimination, inaccessible symbol 
elimination, unit rules elimination and empty rules elimination operations were described and proven correct 
with respect to the preservation of the language generated by the original grammar. 
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1 Introduction 

The formalization of context-free language theory is key to the certification of com¬ 
pilers and programs, as well as to the development of new languages and tools for 
certified programming. The results presented is this paper are part of an ongoing 
work that intends to formalize parts of the context-free language theory in the Coq 
proof assistant. The initial results comprised the formalization of closure properties 
for context-free grammars, namely union, concatenation and Kleene star [30]. 
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In order to follow this paper, the reader is required to have basic knowledge 
of Coq and of context-free language theory. For the beginner, the recommended 
starting points for Coq are the book by Bertot [7], the online book by Pierce [15] 
and a few tutorials available on [20]. Detailed information on the Coq proof assistant, 
as well as on the syntax and semantics of the following definitions and statements, is 
available in [12]. Background on context-free language theory can be found in [33], 
[19] or [31], among others. 

The objective of this work is to formalize a substantial part of context-free lan¬ 
guage theory in the Coq proof assistant, making it possible to reason about it in a 
fully checked environment, with all the related advantages. Initially, however, the 
focus has been restricted to context-free grammars and associated results. Push¬ 
down automata and their relation to context-free grammars will be considered in 
the future. 

When the work is complete, it should be useful for a few different purposes. 
Among them, to make available a complete and mathematically precise description 
of the behavior of the objects of context-free language theory. Second, to offer fully 
checked and mechanized demonstrations of its main results. Third, to provide a 
library with basic and fundamental lemmas and theorems about context-free gram¬ 
mars and derivations that can be used as a starting point to prove new theorems 
and increase the amount of formalization for context-free language theory. Fourth, 
to allow for the certified and efficient implementation of its relevant algorithms in a 
programming language. Fifth, to permit the experimentation in an educational envi¬ 
ronment in the form of a tool set, in a laboratory where further practical observations 
and developments can be done, for the benefit of students, teachers, professionals 
and researchers. 

The general idea of formalizing context-free language theory in the Coq proof 
assistant is discussed in Section 2. The methodology used is briefly reviewed in 
Section 3. Specific results related to the formalization of grammar simplification are 
presented in Section 4. The plan for the rest of this research is presented in Section 
5, and Section 6 considers related work by various other researchers. 

The results reported in this paper are related to the elimination of symbols 
(terminals and non-terminals) in context-free grammars that do not contribute to 
the language being generated, and also to the elimination of unit and empty rules, 
in order to shorten the derivation of the sentences of the language. 

All the definitions and proof scripts presented in this paper were written in plain 
Coq and are available for download at: 
https://github.com/mvmramos/simplification 


2 Basic Definitions 

Context-free grammars were represented in Coq very closely to the usual algebraic 
definition G = (D, S,P, S'), where V is the vocabulary of G (it includes all non¬ 
terminal and terminal symbols), S is the set of terminal symbols (used in the con- 
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struction of the sentences of the language generated by the grammar), N = V \ Ti 
is the set of non-terminal symbols (representing different sentence abstractions), P 
is the set of rules and S' G iV is the start symbol (also called initial or root symbol). 
Rules have the form a —>■ /3, with a & N and /3 € R*. 

Basic definitions in Coq are presented below. The N and E sets are represented 
separately from G (respectively by types non_terininal and terminal). The dis¬ 
joint union of the types non_terminal and terminal is represented by the symbol 
+. Notations sf (sentential form) and sentence represent lists, possibly empty, of 
respectively terminal and non-terminal symbols and terminal only symbols. 

Variables non_terminal terminal: Type. 

Notation sf := (list (non_termina1 + terminal)). 

Notation sentence := (list terminal). 

Notation nlist:= (list non_terminal). 

The record representation cfg has been used for G. The definition states that 
cf g is a new type and contains three components. The first is the start_symbol of 
the grammar (a non-terminal symbol) and the second is rules, that represent the 
rules of the grammar. Rules are propositions (represented in Coq by Prop) that take 
as arguments a non-terminal symbol and a (possibly empty) list of non-terminal and 
terminal symbols (corresponding, respectively, to the left and right-hand side of a 
rule). 

The predicate rules_f inite_def assures that the set of rules of the grammar is 
finite by proving that the length of right-hand side of every rule is equal or less than 
a given value, and also that both left and right-hand side of the rules are built from 
finite sets of, respectively, non-terminal and terminal symbols (represented here by 
lists). 

Definition ru1es_finite_def (ss: non_terminal) 

(rules : non_terminal —> s f —> Prop) 

( n: nat ) 

(ntl: list non_termina1 ) 

(tl: list terminal) := 

In ss ntl /\ 

(forall left: non_terminal , 
forall right: list (non_terminal terminal), 
rules left right — > 
length right <— n /\ 

In left ntl /\ 

(forall s : non_terminal, In (ini s) right —> In s ntl) /\ 

(forall s : terminal, In (inr s) right — > In s tl)) . 

Record cfg: Type:= { 

start_symbol : non_terminal ; 

rules : non_terminal —> sf —> Prop; 

rules_finite: exists n; nat, 

exists ntl : nlist , 
exists tl: tlist , 

ru1es_finite_def start_symbol rules n ntl tl }. 

The decision of representing rules as propositions has the consequence that it 
will prevent executable code to be extracted from the formalization. R would surely 
be desirable to be able to obtain certified algorithms for, in the present case, the 
simplification of context-free grammars. The alternative then would be to represent 
rules as a member of type list (non_terininal * sf) instead. This, however. 
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would have changed the whole declarative approach of the present work into a more 
computational one, by creating functions that manipulate grammars that have the 
desired properties. The purely logical approach, thus, was considered more appealing 
and selected as the choice for the present formalization. Anyway, it does not affect 
the objectives listed in Section 1 and can be adapted in the future in order to allow 
for code extraction, although this should demand a considerable effort in the creation 
and proof of program-related scripts. 

The example below represents grammar that generates language a*b: 

G = {{S', A, B, a, b}, {a, b}, {S' aS', S' b}, S') 


The following are the definitions used to represent G in Coq (as g): 

Inductive non_terminal : Type: = 

I S’ 

I A 
I B. 


Inductive terminal: Type:= 

I a 
I 


Inductive rs : non_terminaI —> s f —> P r o p : = 
r 1 : rs S’ [ inr a ; ini S ’ ] 

I r2 : rs S ’ [ inr b ] . 

Definition g: cfg _ _:= {| 

start_symbo1 := S’; 
ruIe s:= rs ; 

ru1es_finite:= rs_finite |}. 

The term rs_f inite (the proof that the set of rules of g is finite) is not presented 
here, but can be easily constructed and is available from the link provided in Section 

1 . 

Another fundamental concept used in this formalization is the idea of derivation: 
a grammar g derives a string s2 from a string si if there exists a series of rules in g 
that, when applied to si, eventually result in s2. An inductive predicate definition 
of this concept in Coq (derives) uses two constructors. 


Inductive 

derives (g: c 

f g ) : 

sf g 

-> sf g 

-> 

1 derives 

_ r e f 1 : 

f 0 r a 11 

s : sf 

g > 





derives 

g s 

s 



1 derives 

_ s t e p : 

f 0 r a 11 

si s 2 

s3 : 

sf g , 




f 0 r a 11 

left: 

non 

.terminal 

g : 



f 0 r a 11 

right 

: s f 

g » 



derives 

g si 

( s2 

+-1- ini left 



rules g 

left 

rig 

ht -> 




derives 

g si 

( s2 

++ right 

-t+ 


The constructors of this definition (derives_ref 1 and derives_step) are the 
axioms of our theory. Constructor derives_ref 1 asserts that every sentential form 
s can be derived from s itself. Constructor derives_step states that if a sentential 
form that contains the left-hand side of a rule is derived by a grammar, then the 
grammar derives the sentential form with the left-hand side replaced by the right- 
hand side of the same rule. This case corresponds to the application of a rule in a 
direct derivation step. 
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A grammar generates a string if this string can be derived from its root symbol. 
Finally, a grammar produces a sentence if it can be generated from its root symbol. 

Definition generates (g: cfg) (s: sf): Prop:= 
derives g [ini (start_symbo1 g)] s. 

Definition produces (g: cfg) (s: sentence): Prop:= 
generates g (map termina1_1ift s). 

Function terminal_lift converts a terminal symbol into an ordered pair of type 
(non_terminal + terminal). With these definitions, it has been possible to prove 
various lemmas about grammars and derivations, and also operations on grammars, 
all of which were useful when proving the main theorems of this article. 

As an example, the lemma that states that G produces the string aab (that is, 
that aab € L{G)) is represented as: 

Lemma G_produces_aab : 
produces G [ a; a; b]. 

The proof of this lemma can be easily constructed and relates directly to the 
derivations in 5 => aS aaS => aab, however in reverse order because of the way 
that derives is defined. 


3 Methodology 

This formalization is about the definition of a new contex-free grammar from a 
previous one, such that it (i) both grammars generate the same language and (ii) 
the new grammar is free of a certain kind of symbols or rules. For all the four cases 
considered, the following common approach has been adopted: 

(i) Depending on the case, inductively define a new type of non-terminal symbols; 
this will be important, for example, when we want to guarantee that the start 
symbol of the grammar does not appear in the right-hand side of any rule or 
when we have to construct new non-terminal symbols from the existing ones; 

(ii) Inductively define the rules of the new grammar, in a way that allows the con¬ 
struction of the proofs that the resulting grammar has the required properties; 
these new rules will likely make use of the new non-terminal symbols described 
above; 

(iii) Define the new grammar by using the new non-terminal symbols and the new 
rules; define the new start symbol (which might be a new non-terminal symbol 
or an existing one) and build a proof of the finiteness of the set of rules for this 
new grammar; 

(iv) State and prove all the lemmas and theorems that will assert that the newly 
defined grammar has the desired properties. 

In the following section, this approach will be explored with further detail for 
each main result achieved in this work. 
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4 Simplification 


The definition of a context-free grammar allows for the inclusion of symbols and 
rules that might not contribute to the language being generated. Also, context-free 
grammars might also contain sets of rules that can be substituted by equivalent 
smaller and simpler sets of rules. Unit rules, for example, do not expand sentential 
forms (instead, they just rename the symbols in them) and empty rules can cause 
them to contract. Although the appropriate use of these features can be important 
for human communication in some situations, this is not the general case, since it 
leads to grammars that have more symbols and rules than necessary, making difficult 
its comprehension and manipulation. Thus, simplification is an important operation 
on context-free grammars. 

Let G be a context-free grammar, L{G) the language generated by this grammar 
and e the empty string. Different authors use different terminology when presenting 
simplihcation results for context-free grammars. In what follows, we adopt the 
terminology and dehnitions of [33]. 

Context-free grammar simplification comprises four kinds of objects, whose def¬ 
initions and results are presented below: 

(i) An empty rule r G P is a rule whose right-hand side j3 is empty (e.g. X —>■ e). 
We formalize that for all G, there exists G' such that L(G) = L{G') and G' has 
no empty rules, except for a single rule 5 —)• e if e € L(G); in this case, S (the 
initial symbol of G') does not appear in the right-hand side of any rule in G'; 

(ii) A unit rule r G P is a rule whose right-hand side /3 contains a single non¬ 
terminal symbol (e.g. X —)• U). We formalize that for all G, there exists G' 
such that L{G) = L{G') and G' has no unit rules; 

(hi) s G U is useful ([33], p. 116) if it is possible to derive a string of terminal 
symbols from it using the rules of the grammar. Otherwise s is called a useless 
symbol. A useful symbol s is one such that s =^* ui, with w G S*. Naturally, 
this definition concerns mainly non-terminals, as terminals are trivially useful. 
We formalize that, for all G such that L{G) ^ 0, there exists G such that 
L{G) = L{G') and G' has no useless symbols; 

(iv) s G U is accessible ([33], p. 119) if it is part of at least one string generated 
from the root symbol of the grammar. Otherwise, it is called an inaccessible 
symbol. An accessible symbol s is one such that S =>* as/3, with a, (3 G V*. 
We formalize that for all G, there exists G' such that L(G) = L{G') and G' has 
no inaccessible symbols. 

Finally, we formalize a unification result: that for all G, if G is non-empty, then 
there exists G' such that L[G) = L{G') and G' has no empty rules (except for one, if 
G generates the empty string), no unit rules, no useless symbols and no inaccessible 
symbols. 

In all these four cases and hve grammars that are discussed next (namely g_einp, 
g_einp’, g_unit, g_use and g_acc), the proof of the predicate rules_finite is based 
on the proof of the correspondent predicate for the argument grammar. Thus, all 
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new grammars satisfy the cfg specification and are finite as well. 


4-.1 Empty rules 

Result (i) is achieved in two steps. First, the idea of a nullable symbol was repre¬ 
sented by the definition empty: 

Definition empty 

(g: cfg terminal _) (s: non_terminal -j- terminal): Prop: = 

derives g [s] []. 

Notation sf ’ represents a sentential form built with symbols from non_teminal ’ 
and terminal. Definition symbol_lift maps a pair of type (non_terminal + 
terminal) into a pair of type (non_terminal ’ + terminal) by replacing each 
non_terminal with the corresponding non_terminal ’: 

Inductive non_terminal Type:— 

I Lift_nt ; non_terminal —> non_terminal ’ 

I N e w _ s s . 

Notation sf’ ;= (list (non_terminal’ + terminal)). 

Definition symbol_lift 

(s; non_terminal terminal): non_terminal ’ terminal: = 

match s with 
I inr t —> inr t 
I ini n —> ini (Lift_nt n) 
end . 


With these, a new grammar g_emp g has been created, such that the language 
generated by it matches the language generated by the original grammar (g), except 
for the empty string. Predicate g_emp_rules states that every non-empty rule of g 
is also a rule of g_emp g, and also adds new rules to g_emp g where every possible 
combination of nullable non-terminal symbols that appears in the right-hand side of 
a rule of g is removed, as long as the resulting right-hand side is not empty. Finally, 
it adds a rule that maps a new symbol, the start symbol of the new grammar 
(New_ss), to the start symbol of the original grammar. For this reason, the new 
type non_terminal ’ has been defined. The motivation for introducing a new start 
symbol at this point is to be able to prove that the start symbol does not appear in 
the right-hand side of any rule of the new grammar, a result that will be important 
in future developments. 

Inductive g_emp_rules (g; cfg _ _); non_terminal ’ —> sf ’ —> Prop : = 

I Lift_direct ; 

forall left: non_termina1 , 
forall right : sf , 

right <> [] —> rules g left right — > 

g_emp_rules g (Lift_nt left) (map symbol_lift right) 

I Lift_indirect : 

forall left: non_terminal , 
forall right : sf , 

g_emp_rules g (Lift_nt left) (map symbol_lift right)— > 

forall si s2: sf, 

forall s; non_terminal , 

right — si (ini s) :: s2 —> 

empty g (ini s) —> 

s 1 -|—|- s 2 <C> [ ] —> 

g_emp_rules g (Lift_nt left) (map symbol_lift (si ++ s2)) 

I Lift_start_emp : 

g_emp_rules g New_ss [ini (Lift_nt ( start_symbo1 g))]* 
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Definition g_emp (g: cfg non_terminal terminal): 
cfg non_terminal ’ terminal := {| 

start_symbo1:= New_ss; 
rules:= g_emp_rules g; 
ru1es_finite := g_emp_finite g |}. 

Suppose, for example, that S, A, B, C are non-terminals, of which A, B and C 
are nullable, a, b and c are terminals and X —> aAbBcC is a rule of g. Then, the 
above definitions assert that X aAbBcC is a rule of g_emp g, and also: 

• X —> aAbBc; 

• X —> abBcC; 

• X —y dAbcC] 

• X —> aAbc] 

• X —> abBc] 

• X —> abcC] 

• X —> abc. 

Observe that grammar g_einp g does not generate the empty string. The second 
step, thus, was to define g_einp’ g, such that g_emp’ g generates the empty string if 
g generates the empty string. This was done by stating that every rule from g_emp 
g is also a rule of g_emp’ g and also by adding a new rule that allow g_einp’ g to 
generate the empty string directly if necessary. 

Inductive g_emp’.rules (g: cfg _ _): 

non.terminal’ non.terminal —> sf ’ —> Prop := 

I Lift_all : 

forall left: non.terminal ’ 
forall right : sf ’ , 
rules (g_emp g) left right — > 
g_emp ’.rules g left right 
I Lift.empty : 

empty g (ini (start.symbol g)) —> 

g.emp ’.rules g (start.symbo1 (g.emp g)) []. 

Definition g.emp’ (gJ cfg non.terminal terminal): 

cfg (non.termina 1 ’ .) terminal := {| 

start.symbo1:= New.ss .; 

rules:= g.emp ’.rules g; 

rules.finite:= g.emp’.finite g |}. 

Note that the generation of the empty string by g_emp’ g depends on g gener¬ 
ating the empty string. 

The proof of the correctness of these definitions is achieved through the following 
theorem: 

Theorem g.emp ’ .correct : 

forall g: cfg non.terminal terminal , 
g_equiv (g_emp’ g) g /\ 

(generates_empty g —> has_one_empty_ru1e (g_emp ’ 

(~ generates_empty g —> has_no_empty_ru1es (g_emp 
start_symbol_not_in_rhs (g_emp ’ g) . 


g)) /\ 

’ g)) /\ 


Four auxiliary predicates have been used in this statement: g_equiv for two 
context-free grammars that generate the same language, generates_empty for a 


grammar whose language includes the empty string, has_one_empty_rule for a 
grammar that has an empty rule whose left-hand side is the initial symbol, and 
all other rules are not empty and has_no_enipty_rules for a grammar that has no 
empty rules at all. 

The definition of g_equiv is straightforward: 

Variables non_terininal non_terininal ’ terminal : Type . 

Definition g_equiv (gl! cfg non_terminal terminal) 

(g2: cfg non_terminal’ terminal): Prop:= 

forall s: sentence , 

produces gl s <-> produces g2 s. 

When applied to the previous theorem, it translates into: 

forall s: sentence , 

produces (g_emp ’ g) s <—> produces g s. 

For the -> part, the strategy adopted is to prove that for every rule left -^g emp' 
right of (g_einp’ g), either left -^g right is a rule of g or left =>* right in g. For 
the <- part, the strategy is a more complicated one, and involves induction over the 
number of derivation steps in g. 


4.2 Unit rules 

For result (ii), definition unit expresses the relation between any two non-terminal 
symbols X and Y, and is true when X Y. 

Inductive unit (g! cfg terminal non_termina1 ) (aJ non_termina1 ) : 


non_terminal 

1 unit_rule : 

—> Prop 
forall 

(b; 


n 0 n _ t e r m i n a 1 ) , 


rules g 

a 

[ 

ini b] —> unit g 

1 unit_trans ; 

: forall 

b 

c 

; non terminal , 


unit g 

a 

b 

-> 


unit g 

b 

c 

-> 


unit g 

a 

c 



Grammar g_unit g represents the grammar whose unit rules have been substi¬ 
tuted by equivalent ones. The idea is that g_unit g has all non-unit rules of g, plus 
new rules that are created by anticipating the possible application of unit rules in 
g, as informed by g_unit. 

Inductive g_unit_rules (g: cfg _ _): non_terminal —> sf —> Prop := 

I Lift_direct ’ : 

forall left: non_terminal , 
forall right : sf , 

(forall r: non_terminal , 

right O [ini r]) —> rules g left right —> 

g_unit_rules g left right 
I Lift_indirect ’ : 

forall a b; non_terminal , 
unit gab —> 
forall right : sf , 
rules g b right -> 

(forall c: non_terminal , 
right <> [ini c]) —> 

g_unit_rules g a right. 

Definition g_unit (g; cfg _ _): cfg _ _ := {| 

start_symbo1 := start_symbol g; 
rules:= g_unit_rules g; 
ru1es_finite := g_unit_finite g |}. 
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Finally, the correcteness of g_unit comes from the following theorem: 

Theorem g_unit_correct : 
forall g: cfg _ 
g_equiv (g_unit g) g /\ 
has_no_unit_ru1es (g_unit g). 

The predicate has_no_unit_rules states that the argument grammar has no 
unit rules at all. 

Similar to the previous case, for the -> part of the g_equiv (g_unit g) g proof, 
the strategy adopted is to prove that for every rule left -^g unit right of (g_unit 
g), either left —right is a rule of g or left right in g. For the <- part, the 
strategy is also a more complicated one, and involves induction over a predicate 
that is isomorphic to derives {derives3), but generates the sentence directly without 
considering the application of a sequence of rules, which allows one to abstract the 
application of unit rules in g. 


4-3 Useless symbols 

For result (hi), the idea of a useful symbol is captured by the definition useful. 

Definition useful (g: cfg _ _) (s: non_terminal terminal): Prop: = 

match s with 
I inr t —> True 

I ini n => exists s: sentence, derives g [ini n] (map term_lift s) 
end . 

The removal of useless symbols comprises, first, the identification of useless sym¬ 
bols in the grammar and, second, the elimination of the rules that use them. Defi¬ 
nition g_use_rules selects, from the original grammar, only the rules that do not 
contain useless symbols. The new grammar, without useless symbols, can then be 
defined as in g_use. 

Inductive g_use_rules (g: cfg) : non_terminal —> s f —> Prop : = 

I Lift_use : forall left : non_terminal , 
forall right : sf , 
rules g left right — > 
useful g (ini left) —> 

(forall s: non_terminal -j- terminal, In s right —> 
useful g s) —> g_use_rules g left right. 

Definition g_use (g: cfg _ _): cfg _ _:= {| 

start_symbo1 := start_symbol g; 
rules:= g_use_rules g; 
ru1es_finite:= g_use_finite g |}. 

The g_use definition, of course, can only be used if the language generated 
by the original grammar is not empty, that is, if the root symbol of the original 
grammar is useful. If it were useless then it would be impossible to assign a root 
to the grammar and the language would be empty. The correctness of the useless 
symbol elimination operation can be certified by proving theorem g_use_correct, 
which states that every context-free grammar whose root symbol is useful generates 
a language that can also be generated by an equivalent context-free grammar whose 
symbols are all useful. 

Theorem g_use_correct : 
forall g: cfg _ _, 
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noii_einpty g —> 

g_equiv (g_use g) g /\ 

has_no_use1ess_symbo1s (g_use g). 

The predicates non_empty, and has_no_useless_syinbols used above assert, 
respectively, that grammar g generates a language that contains at least one string 
(which in turn may or may not be empty) and the grammar has no useless symbols 
at all. 

The -> part of the g_equiv proof is straightforward, since every rule of g_use is 
also a rule of g. For the converse, it is necessary to show that every symbol used in 
a derivation of g is useful, and thus the rules used in this derivation also appear in 
g_use. 


Inaccessible symbols 

Result (iv) is similar to the previous case, and definition accessible has been used 
to represent accessible symbols in context-free grammars. 

Definition accessible 

(g: cfg _ _) (s: n o n _ t e r m i n a 1 terminal): Prop: = 

exists si s2 : sf , derives g [ini ( s t ar t _ s y mb o 1 g)] ( s 1-H-s : : s 2 ) . 

Definition g_acc_rules selects, from the original grammar, only the rules that 
do not contain inaccessible symbols. Definition g_acc represents a grammar whose 
inaccessible symbols have been removed. 

Inductive g_acc_rules (g: cfg) : non_terminal —> sf —> Prop : = 

I Lift_acc : forall left: non_terminal , 
forall right : sf , 

rules g left right -> accessible g (ini left) -> 
g_acc_rules g left right. 

Definition g_acc (g; cfg _ _); cfg _ _ := {| 

start_symbo1 := start_symbol g; 
rules:= g_acc_rules g; 
ru1es_finite := g_acc_finite g |}. 

The correctness of the inaccessible symbol elimination operation can be certified 
by proving theorem g_acc_correct, which states that every context-free grammar 
generates a language that can also be generated by an equivalent context-free gram¬ 
mar whose symbols are all accessible. 

Theorem g_acc_correct : 
forall g: cfg _ 
g_equiv (g_acc g) g /\ 

has_no_inaccessib1e_symbo1s (g_acc g). 

In a way similar to has_no_useless_syinbols, the absence of inaccessible sym¬ 
bols in a grammar is expressed by predicate has_no_inaccessible_symbols used 
above. 

Similar to the previous case, the -> part of the g_equiv proof is also straightfor¬ 
ward, since every rule of g_acc is also a rule of g. For the converse, it is necessary 
to show that every symbol used in a derivation of g is accessible, and thus the rules 
used in this derivation also appear in g_acc. 
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4-5 Unification 


If one wants to obtain a new grammar simultaneously free of empty and unit rules, 
and of useless and inaccessible symbols, it is not enough to consider the previous 
independent results. On the other hand, it is necessary to establish a suitable order 
to apply these simplifications, in order to guarantee that the final result satisfies all 
desired conditions. Then, it is necessary to prove that the claims do hold. 

For the order, we should start with (i) the elimination of empty rules, followed 
by (ii) the elimination of unit rules. The reason for this is that (i) might introduce 
new unit rules in the grammar, and (ii) will surely not introduce empty rules, as long 
as original grammar is free of them (except for 5 ^ e, in which case S, the initial 
symbol of the grammar, must not appear in the right-hand side of any rule). Then, 
elimination of useless and inaccessible symbols (in either order) is the right thing 
to do, since they only remove rules from the original grammar (which is specially 
important because they do not introduce new empty or unit rules). 

The formalization of this result is captured in the following theorem, which 
represents the main result of this work: 

Theorem g_simp1_exists_v 1 : 
forall g: cfg non_terminal terminal, 
non_empty g —> 

exists g’: cfg (non_termina1 ’ non_termina1 ) 
g.equiv g’ g /\ 

has_no_inaccessib1e_symbo1s g’ /\ 
has_no_use1ess_symbo1s g’ /\ 

(generates_empty g —> has_one_empty_rule g’) 

(~ generates_empty g —> has_no_empty_rules g 
has_no_unit_ru1es g’ /\ 
s t a r t _ s y mb o 1 _ n o t _ i n _ r h s g’. 

Hypothesis non_ empty g is necessary in order to allow the elimination of useless 
symbols. The predicate start_symbol_not_in_rhs states that the start symbol 
does not appear in the right-hand side of any rule of the argument grammar. 

The proof of g_simpl_exists_vl demands auxiliary lemmas to prove that the 
characteristics of the initial transformations are preserved by the following ones. For 
example, unit rules elimination, useless symbol elimination and inaccessible symbol 
elimination operations preserve the characteristics of the empty rules elimination 
operation. 

The proofs of all lemmas and theorems presented in this article have been for¬ 
malized in Coq and comprises approximately 10,000 lines of scripts. This number 
can be explained for the following reasons: 

(i) The style adopted for writing the scripts: for the sake of clarity, each tactic 
is placed in its own line, despite the possibility of combining several tactics in 
the same line. Also, bullets (for structuring the code) were used as much as 
possible and the sequence tactical (using the semicolon symbol) was avoided 
at all. This duplicates parts of the code but has the advantage of keeping the 
static structure of the script related to its dynamic behaviour, which favors 
legibility and maintenance. 

(ii) The formalization includes not only the main theorems described here, but also 


t erminal , 


/\ 

’) /\ 
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an extensive library of other fundamental and auxiliary lemmas on context-free 
grammars and derivations, which have been used to obtain the main results 
presented here, were used in the previously obtained results and will be used 
in future developments. 


5 Further Work 

Current work has focussed on the representation of context-free grammars, context- 
free derivations, the formalization of grammar simplihcation strategies and the cer¬ 
tification of their correctness. It represents an important step towards the formal¬ 
ization of context-free language theory, and adds to the previous results on the 
formalization of closure properties for context-free grammars ([30]). 

The next steps of this formalization work are: 

(i) Describe Chomsky normal form for context-free grammars and prove its exis¬ 
tence for any context-free grammar that satisfies the required conditions; 

(ii) Obtain a formal proof of the Pumping Lemma for context-free languages. 

The second objective relies on the first one, while the first depends directly on 
the results presented here. 

6 Related Work 

Language and automata theory has been subject of formalization since the mid- 
1980s, when Kreitz used the Nuprl proof assistant to prove results about determin¬ 
istic finite automata and the pumping lemma for regular languages [25]. Since then, 
the theory of regular languages has been formalized partially by different researchers 
using different proof assistants (see [11], [22], [16], [10], [26], [27], [2], [1], [28] [8], 
[9], [3], [13], [24] and [34]). The most recent and complete formalization, however, 
is the work by Jan-Oliver Kaiser [14], which used Coq and the SSReflect extension 
to prove the main results of regular language theory. 

Context-free language theory has not been formalized to the same extent so far, 
and the results were obtained with a diversity of proof assistants, including Coq, 
HOL4 and Agda. Most of the effort start in 2010 and has been devoted to the 
certihcation and validation of parser generators. Examples of this are the works 
of Koprowski and Binsztok (using Coq, [23]), Ridge (using HOL4, [32]), Jourdan, 
Pettier and Leroy (using Coq, [21]) and, more recently, Firsov and Uustalu (in Coq, 

[17]). 

On the more theoretical side, on which the present work should be considered, 
Norrish and Barthwal (using HOL4, [4], [5], [6]), published on general context-free 
language theory formalization, including the existence of normal forms for grammars, 
pushdown automata and closure properties. Recently, Firsov and Uustalu proved 
the existence of a Chomsky Normal Form grammar for every general context-free 
grammar (using Agda, [18]). 

It can thus be noted that apparently no formalization has been done in Coq so 
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far for results not related directly to parsing and parser verification, and that this 
constitutes an important motivation for the present work, mainly due to the increas¬ 
ing usage and importance of Coq in different areas and communities. Specifically, 
the formalization done by Norrish and Barthwal in HOL4 is quite comprehensive 
and extends our work with the Greibach Normal Form and pushdown automata and 
its relation to context-free grammars. It does not include, however, a proof of either 
the decidability of the membership problem or the Pumping Lemma for context-free 
languages, which are objectives of the present work. The formalization by Firsov 
and Uustalu in Agda comprises basically the existence of a Chomsky Normal Form, 
and formalizes the elimination of empty and unit rules, but not elimination of useless 
and inaccessible symbols. 

When it comes to computability theory and Turing machines related classes of 
languages, formalization has been approached by Asperti and Ricciotti (Matita, [3]), 
Xu, Zhang and Urban (Isabelle/HOL, [35]) and Norrish (HOL4, [29]). 

7 Conclusions 

The present paper reports an ongoing effort towards formalizing the classical context- 
free language theory, initially based only on context-free grammars, in the Coq proof 
assistant. All important objects have been formalized and different simplification 
strategies on grammars have been implemented. Proofs of their correctness were 
successfully constructed. 

Building up on the previous formalization of closure properties for context-free 
grammars [30], the present results create a comfortable situation in order to pursue 
the formalization of normal forms for context-free grammars, the next step of this 
work. 

The authors acknowledge the fruitful discussions with Nelma Moreira (Depar- 
tamento de Ciencia de Computadores da Faculdade de Ciencias da Universidade 
do Porto, Portugal), Jose Carlos Bacelar Almeida (Departamento de Informatica 
da Universidade do Minho, Portugal) and Arhur Azevedo de Amorim (University of 
Pennsylvania) as well as their contributions to this work. Also, we are grateful to the 
anonymous reviewers who provided useful criticisms and insights, and contributed 
to a better presentation of this work. 
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